\documentclass[a4paper,%
11pt,%
DIV14,
headsepline,%
headings=normal,
]{scrartcl}
%\usepackage{ngerman}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{textcomp}

% for matlab code
% bw = blackwhite - optimized for print, otherwise source is colored
%\usepackage[framed,numbered,bw]{mcode}

% for other code
%\usepackage{listings}

%\usepackage[ansinew]{inputenc}

\usepackage{color}
\usepackage{hyperref}
\usepackage{graphicx}
\usepackage{listings}
\usepackage[hang, font = footnotesize]{caption}
%\usepackage{german}
\usepackage{algorithm}
\usepackage{algpseudocode}
\usepackage{enumitem}
\usepackage{amssymb}
\usepackage{amsmath}
\usepackage{mathrsfs}
\usepackage{mathalpha}
\usepackage{wrapfig}
\usepackage{listings}
\lstset{
tabsize = 4, %% set tab space width
showstringspaces = false, %% prevent space marking in strings, string is defined as the text that is generally printed directly to the console
numbers = left, %% display line numbers on the left
commentstyle = \color{red}, %% set comment color
keywordstyle = \color{blue}, %% set keyword color
stringstyle = \color{red}, %% set string color
rulecolor = \color{black}, %% set frame color to avoid being affected by text color
basicstyle = \small \ttfamily , %% set listing font and size
breaklines = true, %% enable line breaking
numberstyle = \tiny,
}

\renewcommand{\thesubsection}{Ex~\arabic{section}.\arabic{subsection}}

\lstset{basicstyle=\footnotesize\ttfamily,breaklines=true}

\begin{document}



\hrule height 1px
\vspace*{1ex}
\begin{minipage}[t]{.45\linewidth}
\strut\vspace*{-\baselineskip}\newline
\includegraphics[height=.9cm]{res/Inf-Logo_black_en.pdf}
\includegraphics[height=.9cm]{res/par-logo.pdf}
\end{minipage}
\hfill
\begin{minipage}[t]{.5\linewidth}
\flushright{
Research Group for Parallel Computing\\%
Faculty of Informatics\\%
TU Wien}
\end{minipage}
\vspace*{1ex}

\hrule 

\vspace*{2ex}

\begin{center}
{\LARGE\textbf{Advanced Multiprocessor Programming}}\\
{\large{}%
	Summer term 2023 \\
	Theory exercise 1 of 2 \\
\vspace*{2ex}
Norbert Tremurici (11907086)

%\semestershort\\
%\assignmentname{0}}
}
\end{center}
\hfill
\begin{minipage}[b]{1\linewidth}
\hfill
\begin{tabular}{@{}ll@{}}
Issue date: & 2023-03-27\\
Due date:   & \color{red}{2023-04-24 (23:59)}
\end{tabular}
\end{minipage}

\vspace*{3ex}
\hrule 
\vspace*{1ex}

\begin{flushright}
	\textbf{\underline{Total points: 38 (+ 2 bonus)}}
\end{flushright}

%===================================
\section{Amdahl's Law} \label{sec:amdahl}
\subsection{(2 points)} \label{ex:6_1}
Suppose $98\%$ of a program's execution time is perfectly parallelizable.
\begin{enumerate}[label = \alph*)]
	\item What is the overall speedup that can be achieved by running said program on a machine with 96 processing cores?

\textbf{Solution} Amdahl's Law gives us the achievable speedup by plugging in the fraction of parallelizable work $p = 0.98$ and core count $n = 96$.

$S = \frac{1}{1-p+p/n} = \frac{1}{0.02 + 0.98 / 96} = \frac{9600}{96 \cdot 2 + 98} = \frac{9600}{290} \approx 33.103$

	\item Provide a tight upper limit for the overall speedup that can be achieved for this program (on any machine).

Because a speedup is achieved by parallelizing the parallelizable workload on as many cores as possible, we can determine a tight upper limit for the overall speedup by taking the limit of the core count $n$ towards infinity.

$\hat S = \lim\limits_{n \to \infty} \frac{1}{1-p+p/n} = \frac{1}{1-p} = \frac{1}{0.02} = 50$

Conceptually, even if the parallel workload is infinitely sped up, the execution time is limited by the sequential fraction.

\end{enumerate}
\subsection{(2 points)} \label{ex:6_2}
The company you work for is about to upgrade all of their servers to slower but more energy efficient processors with much higher core counts.
Suppose the sequential part of the main program running on these servers accounts for $20\%$ of the program's computation time.
In order to compensate for the slower processing speeds and to better utilize the higher core count, management asks you to optimize the sequential part of said program --- make it run $k$ times faster ---, s.t. the revised program scales 3 times better with a given number of processing cores meaning the \emph{relative (!)} speedup of the revised program is 3 times higher than that of the original program.
What value of $k$ should you require for a given number of processor cores $n$?

\textbf{Solution} We can solve this problem by considering two speedups $S_1$ and $S_2$ with a parallelizable fraction $p = 0.8$, additionally requiring $S_2$ to be 3 times as high as $S_1$ and for $S_2$ to have a factor of $1/k$ in the sequential fraction

$$
\begin{aligned}
S_1 &= \frac{1}{0.2 + 0.8/n} \\
S_2 &= \frac{0.2/k + 0.8}{0.2/k + 0.8/n} = 3 S_1
\end{aligned}
$$

We can then solve for $k$

$$
\begin{aligned}
\frac{3}{0.2 + 0.8/n} &= \frac{0.2/k + 0.8}{0.2/k + 0.8/n} \\
\frac{30n}{2n + 8} &= \frac{2n + 8kn}{2n + 8k} \\
\frac{30}{2n + 8} &= \frac{2 + 8k}{2n + 8k} \\
30(2n + 8k) &= (2 + 8k)(2n + 8) \\
60(n + 4k) &= 4n + 16 + 16kn + 64k \\
15n + 60k &= n + 4 + k(4n + 16) \\
k(60 - 4n + 16) &= n + 4 - 15n \\
k &= \frac{4 - 14n}{44 - 4n} \\
k &= \frac{1 - 3.5n}{11 - n} \\
\end{aligned}
$$

Do note that for $n \leq 11$, it is impossible to achieve the desired result because $k$ is either infinite or negative.

\subsection{(2 points + 2 bonus points)} \label{ex:6_3}
Let $p = 0.7$ denote the perfectly parallelizable fraction of a given program.
Suppose there are two \emph{mutually exclusive} optimizations to be done on this program, meaning if one optimization is done, the other is no longer available:
\begin{enumerate}[label={(\Roman*)}]
	\item Improve the execution time of the parallelizable part $p$ by a factor of 7. \label{ex:6_3_it1}
	\item Improve the execution time of the sequential part $1-p$ by a factor of 3. \label{ex:6_3_it2}
\end{enumerate}
\begin{enumerate}[label = \alph*)]
	\item Provide the relative speedup $s(n)$ of the original program, as well as of the program after optimization \ref{ex:6_3_it1} $s^{\ref{ex:6_3_it1}}(n)$ and the one after \ref{ex:6_3_it2} $s^{\ref{ex:6_3_it2}}(n)$ for general processor core count $n$ and for $n = 10$.
		Calculate the execution times $T^{\ref{ex:6_3_it1}}_n$ and $T^{\ref{ex:6_3_it2}}_n$ (of the two optimized programs) for $n=10$ in terms of the original workload normalized to 1.
		(\emph{Hint: $T^{\ref{ex:6_3_it1}}_1 = 0.4$ and $T^{\ref{ex:6_3_it2}}_1 = 0.8$})

With the sequential time normalized to 1, the relative speedup $s(n)$ is given by

$$s(n) = \frac{T_1}{T_p} = \frac{0.3 + 0.7}{0.3 + 0.7/n} = \frac{10n}{3n + 7}$$

The optimizations resolve to

$$s^{(I)}(n) = \frac{0.3 + 0.7/7}{0.3 + 0.7/7n} = \frac{0.4}{0.3 + 0.1/n} = \frac{4n}{3n + 1}$$

$$s^{(II)}(n) = \frac{0.3/3 + 0.7}{0.3/3 + 0.7/n} = \frac{0.8}{0.1 + 0.7/n} = \frac{8n}{n + 7}$$

For $n = 10$, we get $s(10) = \frac{100}{37} \approx 2.703$, $s^{(I)}(10) = \frac{40}{31} \approx 1.29$ and $s^{(II)}(10) = \frac{80}{17} \approx 4.706$.

If the original execution times are all normalized to 1, we can get the new execution times for the two optimization by $T_{10}^{(I)} = \frac{1}{s^{(I)}(10)} = \frac{31}{40} \approx 0.775$ and $T_{10}^{(II)} = \frac{1}{s^{(II)}(10)} = \frac{17}{80} \approx 0.213$.

	\item Find a tight bound $n'$ for the processing core count s.t. for $n \ge n'$ optimization \ref{ex:6_3_it2} results in \emph{lower execution times ($\ne$ speedup)} than optimization \ref{ex:6_3_it1}.
		Provide your starting inequality!

\textbf{Solution} We start out with the inequality expressing the desired situation

$$
\begin{aligned}
T_{n}^{(I)} &\geq T_{n}^{(II)} \\
0.3 + \frac{0.7}{7n} &\geq \frac{0.3}{3} + \frac{0.7}{n} \\
3 + \frac{1}{n} &\geq 1 + \frac{7}{n} \\
2 &\geq \frac{6}{n} \\
n &\geq 3 \\
\end{aligned}
$$

We get the result that for $n \geq 3$, optimization $(II)$ results in a lower execution time, $n' = 3$.
		
	\item \emph{[2 bonus points]} Explain why optimization \ref{ex:6_3_it1} results in a lower relative speedup than even the original (unoptimized) program.
		What is the intuition and potential pitfall behind relative speedup?

\textbf{Solution} Because we achieve speedups by decreasing the execution time of the parallelizable fraction of the program execution, we can see how an increasing core count reduces only the parallelizable fraction.
In optimization $(I)$ we improve the performance by making the initial parallel fraction lower by a factor of 7, which is generally an improvement, but when looking at the speedup achieved by varying the core count $n$, we observe lower speedups as parallelization scales worse than before.
There is less parallelizable workload to be sped up, so the relative speedup appears worse than even the original program.


\end{enumerate}

% This (type of) exercise is a total pain to correct and doesn't really contribute much to students' understanding.
% Hence I cut it
%\subsection{(1 point)} \label{ex:7}
%Running your application on two processors yields a speedup of $S_2$.
%Use Amdahl's law to derive a formula for $S_n$, the speedup on $n$ processors, in terms of $n$ and $S_2$.

\subsection{(2 points)} \label{ex:8}
You have a choice between buying one uniprocessor that executes ten billion instructions per second, or a twenty-processor multiprocessor where each processor executes one billion instructions per second.
Explain how you would decide which to buy for a particular application.

\textbf{Solution} $CPU_U$ has $n = 1$ and processes $10^{10}$ instructions/s/core, while $CPU_M$ has $n = 20$ and processes $10^9$ instructions/s/core.
We get general execution times $T^U = 0.1$ and $T^M = p - 1 + \frac{p}{20}$.

If we want to find out at which point either processor becomes more viable, we need to consider at which parallel fraction $p$ the multiprocessor CPU becomes faster, if any. We can set their execution times equal (effectively considering the point at which one processor overtakes the other) and see which parallel fraction $p$ results from that.

$$
\begin{aligned}
T^U = T^M \\
0.1 = 1 - p + \frac{p}{20} \\
2 = 20 - 20p + p \\
19p = 18 \\
p = \frac{18}{19} \approx 0.947
\end{aligned}
$$

\newpage

\section{Locks} \label{sec:locks}


%\subsection{(2 points)} \label{ex:9}
%Define $r$-bounded waiting for a given mutual exclusion algorithm to mean that if $D_A^j \rightarrow D_B^k$ then $CS_A^j \rightarrow CS_B^{k+r}$.
%Is there a way to define a doorway for the Peterson algorithm such that it provides $r$-bounded waiting for some value of $r$?
%If yes, define a doorway (a code interval in Figure \ref{fig:peterson_lock}) and prove the statement for a specific $r$.
%Otherwise sketch an impossibility proof.
%
%\begin{figure}[H]
%	\centering \includegraphics[width=9cm]{res/Peterson Lock.jpg}
%	\caption{Peterson Lock} \label{fig:peterson_lock}
%\end{figure}

\subsection{(2 points)} \label{ex:10}
Why do we need to define a doorway section and why cannot we define FCFS in a mutual exclusion algorithm based on the order in which the first instruction in the $lock()$ method was executed?
Argue your answer in a case-by-case manner based on the nature of the first instruction executed by $lock()$: a read or write, to separate or the same location.

\textbf{Solution} We can consider two threads A and B and exhaust all possible cases:

\begin{itemize}
	\item A \textit{read} instruction by A and B is performed to the \textit{same memory location}. In this case, it is not clear which read is performed first, if it is impossible to tell which read was performed first, we cannot define FCFS.
	\item A \textit{read} instruction by A and B is performed to \textit{different memory locations}. This case suffers the same problem as above.
	\item A \textit{write} instruction by A and B is performed to the \textit{same memory location}. In this scenario it is possible to tell who went first, but the thread that writes last still cannot tell whether there was a thread that came before it since the write may have overwritten a previous write.
	\item A \textit{write} instruction by A and B is performed to \textit{different memory locations}. Because they write to different memory locations, it is impossible to tell who went first.
\end{itemize}

We can see that in all cases, we cannot uphold a fair order for both threads because of a lack of information to both threads, regardless of the first instruction performed and its order. Thus we cannot define FCFS in a mutual exclusion algorithm this way.

\subsection{(4 points)} \label{ex:11}
Programmers at the Flaky Computer Corporation designed the protocol shown in Figure \ref{fig:flaky_lock} to achieve $n$-thread mutual exclusion.
For each question either sketch a proof or display an execution, where it fails.
\\
\begin{minipage}{\textwidth}
	\begin{minipage}{0.5\textwidth}
		\begin{enumerate}[label=\alph*)]
			\item Does this protocol satisfy mutual exclusion?
			\item Is this protocol starvation-free?
			\item Is this protocol deadlock-free?
		\end{enumerate}
	\end{minipage}
	\begin{minipage}{0.5\textwidth}
		\begin{figure}[H]
			\includegraphics[width=0.7\textwidth]{res/Flaky Lock.jpg}
			\caption{Flaky Lock} \label{fig:flaky_lock}
		\end{figure}
	\end{minipage}
\end{minipage}

\textbf{mutually exclusive}

Yes, we can prove that this protocol satisfies mutual exclusion by sketching a proof by contradiction.

\textbf{Proof} Assume two threads have entered the critical section at the same time, that is to say, two threads A and B have performed a \texttt{read} operation on \texttt{turn} and exited the while loop on line number 11.

The sequence of events (without loss of generality) must have been:

$$
\begin{aligned}
write_B(turn = B) &\to write_A(turn = A) \Rightarrow \\
read_B(busy == false) &\to read_A(busy == false) \Rightarrow \\
write_B(busy = true) &\to write_A(busy = true) \Rightarrow \\
read_B(turn == B) &\to read_A(turn == A)
\end{aligned}
$$

Where (in this example) the events separated by the single precedence operator $\to$ may be swapped while the events separated by the strict precendence operator $\Rightarrow$ may not be swapped, for the sake of this argument.
Any of these sequences are not possible, both A and B require a read on \texttt{turn} with their respective values, but in between setting this value and reading it later, there must have been a read and write on \texttt{busy}.
Only one thread can successfully set and keep \texttt{turn} and pass the \texttt{busy} check.

\textbf{not starvation-free}

With two threads, one thread can enter infinitely many times.
We can construct an execution with two threads A and B where one thread enters infinitely many times.

$$
\begin{aligned}
write_B(turn = B) &\to write_A(turn = A) \to read_A(busy == false) \to \\
write_A(busy = true) &\to read_A(turn == A)
\end{aligned}
$$

At this point, A exits \texttt{lock()}, B keeps trying.

$$
read_B(busy == true) \to write_B(turn = B) \to write_A(busy = false)
$$

Now A exits \texttt{unlock()}, enters \texttt{lock()} again.

$$
\begin{aligned}
write_A(busy = false) &\to write_A(turn = A) \to read_A(busy == false) \to \\
write_A(busy = true) &\to read_A(turn == A)
\end{aligned}
$$

And A has successfully locked another time and exits \texttt{lock()}.

We can repeat this execution in this way indefinitely.

\textbf{not deadlock-free}

We can construct an indefinite execution where two threads A and B prevent each other from ever acquiring the lock.

$$
\begin{aligned}
write_A(turn = A) &\to write_B(turn = B) \to read_A(busy == false) \to write_A(busy = true) \to \\
read_B(busy == true) &\to read_A(turn == B)
\end{aligned}
$$

At this point both A and B are back to line 8 (where they started) and the cycle can repeat anew.

%\subsection{(2 points)} \label{ex:12}
%Show that the Filter lock allows some threads to overtake others (who are at least attempting level 1, otherwise the statement is trivial) an unbounded number of times.

\subsection{(10 points)} \label{ex:13}
Another way to generalize the two-thread Peterson lock seen in Figure \ref{fig:peterson_lock} is to arrange a number of 2-thread Peterson locks in a binary tree.
Suppose $n$ is a power of two.
Each thread is assigned a leaf lock which it shares with one other thread.
Each lock treats one thread as thread 0 and the other as thread 1.

In the tree-lock's acquire method, the thread acquires every two-thread Peterson lock from that thread's leaf to the root.
The tree-lock's release method for the tree-lock unlocks each of the 2-thread Peterson locks that thread has acquired, \emph{starting from the leaf up to the root}.
At any time, a thread can be delayed for a finite duration.
(In other words, threads can take naps, or even vacations, but they do not drop dead.)
For each of the first three listed properties below, either sketch a formal induction proof that it holds, or describe a (possibly infinite) execution, where it is violated.
In case of a violation: can you find a (simple) fix to make it work?
If so, provide the necessary changes and sketch an induction proof that your fix really works.


\begin{figure}[H]
	\centering \includegraphics[width=0.47\textwidth]{res/Peterson Lock.jpg}
	\caption{Peterson Lock} \label{fig:peterson_lock}
\end{figure}

\begin{enumerate}[label=\alph*)]
	\item Mutual exclusion

\textbf{Solution} Mutual exclusion does not hold, because of the way the tree is unlocked.
We can construct an example, starting from a state where one thread has currently locked all its stages, with threads from the left and right subtrees attempting to acquire their locks.

Assume there are three threads A, B and C and at least three lock nodes N, E and W (north, east and west).
N is the root node, W is of the left subtree and E is of the right subtree.
Threads A and B are leafs of W, C is a leaf of E.
Without loss of generality, assume A has locked N (and thus has also locked W).

The states of the nodes are, $flag_W = [true, true]$, $victim_W = 1$, $flag_E = [true, false]$, $victim_E = 0$, $flag_N = [true, true]$, $victim_N = 1$.
That is to say, B is trying to lock W and C has locked E and is trying to lock N, A has locked W and N.

We can now construct a sequence of events that violates mutual exclusion as A tries to unlock from the leaf up.

$$
\begin{aligned}
write_A(flag_W[0] = false) &\to read_B(flag_W[0] == false) \to \\
write_B(flag_N[0] = true) &\to write_B(victim_N = 0) \to \\
write_C(flag_N[1] = true) &\to write_C(victim_N = 1) \to \\
write_A(flag_N[0] = false) &\to read_B(victim_N == 1) \to \\
read_C(flag_N[0] == false)
\end{aligned}
$$

We can now see that C has read $flag_N[0]$ to be $false$ (thus entering the critical section) and B has read $victim_N$ to be 1 (thus entering the critical section as well).
Mutual exclusion is violated.

The problem stems from the fact that in the part of the tree consisting of nodes N, W and E, there can be three threads simultaneously.
In our example there are B and C waiting for N, while A occupies and unlock N.
While A is busy going up the tree and unlocking the locks, another thread from the left subtree can move up along and eventually contest the root lock with a thread from the right subtree.

A \textbf{simple fix} would be to change the way the tree lock is unlocked:
the thread unlocks all locks, starting from the root node to the leafs.
Now, this situation where one thread has the root node locked and two other threads are contesting the root node cannot happen, because A must have started from the root node and continued down the tree, blocking a thread from its subtree until these other locks are unlocked.

We can sketch an inductive proof that mutual exclusion is satisfied after this fix.

\textbf{Induction hypothesis} The tree-lock satisfies mutual exclusion for n threads.

\textbf{Base case} For $n = 2$, there is only a single Peterson lock, which we know to satisfy mutual exclusion.

\textbf{Induction step} For the induction step, we grow the tree by another layer, $n' = 2n$ (subtrees handle $n$ threads).
By the induction hypothesis, we know that each subtree guarantees mutual exclusion and thus there can be only one thread at the root node of each subtree.

We now consider all cases, either the new root node is locked or unlocked.

If it is unlocked, the contest can only be between two threads, one from each subtree and the situation resembles a regular Peterson lock.

It it is locked, then one subtree must have its root node locked by the same thread that occupies the new root node, thus there can only be one thread attempting to acquire the lock from the other subtree.
The problematic scenario described previously cannot happen, because A unlocks from the new root node first.
A has to unlock two locks before a thread from its subtree can move up and either the thread of the other subtree acquires the new root node's lock, or we reach a situation where there are two threads contesting the new root node lock with it being unlocked, which is again a situation resembling a regular Peterson lock.

Thus after the induction step the induction hypothesis still holds for $n' = 2n$ threads.

	\item Deadlock freedom

\textbf{Solution} The tree-lock is deadlock-free.

\textbf{Induction hypothesis} The tree-lock is deadlock-free for n threads.

\textbf{Base case} For $n = 2$, the tree-lock resembles a regular tree-lock, which we know to be deadlock-free.

\textbf{Induction step} Growing the tree by another layer, $n' = 2n$, we need to consider whether at any layer all threads can deadlock.
By our induction hypothesis, we know that not all threads of both subtrees can become deadlocked, so we only consider the new root node.
If the root node is currently locked, then upon unlocking (which it will after a finite amount of time) a thread coming from the other subtree will succeed.
If the root node is not currently locked, then there can be a contest between at most two threads coming from the left and right subtrees.
This situation resembles a regular Peterson lock, which we know to be deadlock-free.

Thus the induction hypothesis still holds for $n' = 2n$ threads.

	\item Starvation freedom

The induction argument for starvation freedom is analogous to the induction argument for deadlock freedom.
	
	\item Is there an upper bound on the number of times the tree-lock can be acquired and released between the time a thread starts acquiring the tree-lock and when it succeeds?
		If so, sketch a proof, otherwise construct an unbounded execution.

\textbf{Solution} We can place an upper bound on the number of times the tree-lock can be acquired and released between the time a thread starts acquiring the tree lock and the time that it succeeds.
This is because we don't unlock the tree-lock starting from the root, but from the leaf.

In order to get an execution where a thread is overtaken an unbounded number of times, we would need that this thread (lets call this thread thread A) is waiting at some layer $l$ for a lock to be unlocked while at the (uppermost) root layer $h$, $h > l$, threads from the other subtree acquire and release the lock an unbounded number of times.
This cannot be, as the situation that at layer $l$ thread A is blocked by another thread that has acquired the tree-lock while from the other subtree other threads pass A an unbounded number of times is impossible.
The lock at $h$ is unlocked strictly after the lock at $l$ is unlocked.

The tree-lock is unlocked starting from the leaf, so before the root node at layer $h$ becomes unlocked, the lock at layer $l$ becomes unlocked, unblocking A.
It could be that A is blocked by another thread of the same subtree, but as we have seen the tree-lock remains starvation-free and it cannot be blocked indefinitely.
If the unlocking thread naps or takes a vacation at any stage before having unlocked the root node, no one can acquire the tree-lock.
We have also seen that this lock does not guarantee mutual exclusion, so in theory two threads can overtake A at layer $l$ (if $l$ is not the lowermost layer).
The worst that can happen if at any stage mutual exclusion is violated is that A gets to the root node lock faster than it would otherwise.
In the worst case, A is only slowly shifted up the tree, we can be safe by picking a conservative upper bound like $10^h$, where $h$ is the uppermost layer of the tree and also the height of the tree.

\end{enumerate}

\subsection{(4 points)} \label{ex:14}
The $\mathfrak{L}$-exclusion problem is a variant of the starvation-free mutual exclusion problem.
We make two changes: as many as $\mathfrak{L}$ threads may be in the critical section at the same time, and fewer than $\mathfrak{L}$ threads might fail (by halting) in the critical section.

An implementation must satisfy the following conditions:
\begin{itemize}
	\item $\mathfrak{L}$\textbf{-exclusion:} at any time at most $\mathfrak{L}$ threads are in the critical section.
	\item $\mathfrak{L}$\textbf{-starvation-freedom:} as long as fewer than $\mathfrak{L}$ threads are in the critical section, some thread that wants to enter the critical section will eventually succeed (even if some threads in the critical section have halted).
\end{itemize}

Modify the $n$-process Filter mutual exclusion algorithm from Figure \ref{fig:filter_lock} to turn it into an $\mathfrak{L}$-exclusion algorithm.
Provide the whole source code!
\begin{figure}[H]
	\centering \includegraphics[width=10cm]{res/Filter Lock.jpg}
	\caption{Filter Lock} \label{fig:filter_lock}
\end{figure}

\textbf{Solution} The filter lock works by filtering one thread at each level, in combination with there being $n - 1$ levels to get through.
As $n - 1$ threads are filtered, only 1 threads gets into the critical section.

The first change is simple and obvious then, because we want to let $l$ threads pass, we can eliminate $l$ levels, which makes it possible for $l$ threads to get past instead of 1.
The new number of levels is $n - l - 1$.

A second change is necessary for the while loop that spins on the condition that there is a thread on higher levels.
If there is a thread in the critical section, then all threads are blocked in this while loop, even though we require it to be possible for $l$ threads to proceed.

The fix is change this check into a different kind of check:
from the perspective of our thread, we count all threads that are on levels smaller or equal to our own thread.

If this number is greater than $n - l$, then there can be at most $l - 1$ threads in the critical section, so we spin until this number becomes greater than $n - l$ (that is, while this number is smaller or equal to $n - l$).

The full code can be seen in the listing below.

\begin{lstlisting}[language = Java]
class Filter implements Lock {
	int[] level;
	int[] victim;

	public Filter(int n) {
		level = new int[n];
		victim = new int[n - l];
		for (int i = 0; i < n - l; i++) {
			level[i] = 0;
		}
	}

	public void lock() {
		int me = ThreadID.get();
		// CHANGE: l filters are cut off
		for (int i = 1; i < n - l; i++) { // attempt level i
			level[me] = i;
			victim[i] = me;
			// CHANGE: it is counted if there are n-l threads in levels 1 to i
			do {
				int hasntPassedMe = 1; // me is not past me
				for (k = 0; k <= i; k++)
					if (k != me && level[k] <= i)
						hasntPassedMe++;
			} while (hasntPassedMe <= n - l && victim[i] == me);
			// the new code spins until the victim is changed or the count of hasntPassedMe < n - l, which implies that there can only be less than l threads in the critical section
		}
	}

	public unlock() {
		int me = ThreadID.get();
		level[me] = 0; // through the filter
	}
}	
\end{lstlisting}

\subsection{(2 points)} \label{ex:15}
In practice, almost all lock acquisitions are uncontended, so the most practical measure of a lock's performance is the number of steps needed for a thread to acquire a lock when no other thread is concurrently trying to acquire the lock.

Scientists at Cantaloupe-Melon University have devised the following `wrapper' for an arbitrary lock, shown in Figure \ref{fig:fastpath_lock}.
They claim that if the base $Lock$ class provides mutual exclusion and is starvation-free, so does the $FastPath$ lock, but it can be acquired in a constant number of steps in the absence of contention.
Sketch an argument why they are right, or give a counterexample.
\begin{figure}[H]
	\centering \includegraphics[width=10cm]{res/FastPath Lock.jpg}
	\caption{FastPath Lock} \label{fig:fastpath_lock}
\end{figure}

\textbf{Solution} The scientists at Cantaloupe-Melon University seem to have failed their objective, because it appears that we can give a counterexample to the claim that it provides mutual exclusion and is starvation-free (by showing that it does not satisfy mutual exclusion).

Consider an execution between threads A and B, with $y = -1$ (no other thread has currently acquired the lock).

$$
\begin{aligned}
write_A(x = 0) &\to write_B(x = 1) \to \\
read_A(y == -1) &\to read_B(y == -1) \to \\
write_A(y = 0) &\to write_B(y = 1) \to \\
read_A(x == 1) &\to read_B(x == 1)
\end{aligned}
$$

At this point, both A and B are at the if condition in line 10.
A sees that $x \not = 0$ and enters the slow path, locking traditionally and acquiring the internal lock.
B sees that $x = 1$ and foresakes the slow path, passing on and thereby successfully acquiring the fast-path lock.

Mutual exclusion is violated and it becomes apparent once more that there are no shortcuts to life!

\section{Register construction} \label{sec:register_con}

\subsection{(2 points)} \label{ex:38_new}
Consider the safe Boolean MRSW construction shown in Figure \ref{fig:safe_bool_mrsw}.
\\
True or false: if we replace the safe Boolean SRSW register array with an array of atomic SRSW registers, then the construction yields an atomic Boolean MRSW register.
Justify your answer by sketching a proof or providing a counterexample.
\begin{figure}[H]
	\centering \includegraphics[width=10cm]{res/Safe Boolean MRSW.jpg}
	\caption{Safe Boolean MRSW construction} \label{fig:safe_bool_mrsw}
\end{figure}

\textbf{Solution} False.

Consider four events $R_1$, $R_2$, $R_3$ and $W$ as reads and writes respectively.
We construct a counterexample by providing the following relations:
$R_1 = read(false)$, $R_2 = read(true)$, $R_3 = read(false)$, $W = write(true)$ with $R_1 \to W$, $R_2 \to R_3$, $R_2 \not \to W$ and $R_3 \not \to W$.
That is to say, the previous value was $false$, the new value $true$ is being written.
Read $R_3$ happens strictly after read $R_2$, and reads $R_2$ and $R_3$ happen concurrently to the write $W$.

\begin{figure}[H]
	\centering \includegraphics[width=12cm]{res/ex-3-1.png}
	\caption{Ex 3.1 counterexample} \label{fig:ex-3-1}
\end{figure}

This has been illustrated in figure \ref{fig:ex-3-1}.

Because the MRSW register needs to be atomic, it should be linearizable with respect to the write and reads.
But as $R_3$ happens strictly after read $R_2$, no such linearization points in $W$, $R_2$ and $R_3$ can be found, thus violating atomicity.
For this to be possible, $R_2$ and $R_3$ would have to overlap.
This example is in fact a valid example, because the atomic SRSW registers are written one after another, where the threads with lower thread IDs can read after the write to their register is committed while the threads with higher thread IDs can read before the write to their register is committed.

The answer is no, the construction does not yield an atomic Boolean MRSW register.

%\subsection{(2 points)} \label{ex:35}
%Consider the safe Boolean MRSW construction shown in Figure \ref{fig:safe_bool_mrsw}.
%\\
%True of false: if we replace the safe Boolean SRSW register array with an array of regular Boolean SRSW registers, then the construction yields a regular Boolean MRSW register.
%Justify your answer by sketching a proof or providing a counterexample.

\subsection{(2 points)} \label{ex:36}
Consider the atomic MRSW register construction shown in Figure \ref{fig:atomic_mval_mrsw}.
\\
True or false: if we replace the atomic SRSW registers with regular SRSW registers, then the construction still yields an atomic MRSW register.
Justify your answer by sketching a proof or providing a counterexample.

\begin{figure}[H]
	\centering \includegraphics[width=12cm]{res/Atomic M-val MRSW.jpg}
	\caption{Atomic MRSW construction} \label{fig:atomic_mval_mrsw}
\end{figure}

\textbf{Solution} False.

Consider three events $R_1$, $R_2$ and $W$, $R_1 = read(<1, A>)$, $R_2 = read(<0, B>)$ and $W = write(<1, A>)$.
In this example, the previous and committed value was $<0, B>$.
We have relations $R_1 \to R_2$ as well as $R_1 \not \to W$, $R_2 \not \to W$ (and vice-versa).
That is to say, $R_2$ happens strictly after $R_1$ and $W$ is concurrent to $R_1$ and $R_2$.
This has been illustrated in figure \ref{fig:ex-3-2}.

\begin{figure}[H]
	\centering \includegraphics[width=12cm]{res/ex-3-2.png}
	\caption{Ex 3.2 counterexample} \label{fig:ex-3-2}
\end{figure}

The problem here is that it is valid for $R_2$ to be a read returning the previous value $<0, B>$.
The register construction in question is an atomic MRSW register, but the internal registers are regular SRSW registers.
This means there is a register in the diagonal for thread 1, for which the write can be pending.
Because both reads happen concurrently, it is perfectly acceptable for the first read $R_1$ to read the new value while the second read $R_2$ reads the old value.
Usually we would combat this problem by assisting propagation of values in read operations, but exactly for this register in the diagonal that is being written, the readers are disallowed from writing it.
In the code, any registers in the diagonal are skipped, as can be seen in line 23 of figure \ref{fig:atomic_mval_mrsw}.
Similar to before in Ex 3.1, this example is not linearizable, so our construction cannot be a valid atomic MRSW register.

%\subsection{(2 points)} \label{ex:34}
%Consider the safe Boolean MRSW construction shown in Figure \ref{fig:safe_bool_mrsw}.
%\\
%True or false: if we replace the safe Boolean SRSW register array with an array of safe $M$-valued registers, then the construction yields a safe $M$-valued MRSW register.
%Justify your answer by sketching a proof or providing a counterexample.

%\subsection{(2 points)} \label{ex:37_new}
%Give an example of a sequentially-consistent execution that is not safe.

%\subsection{(2 points)} \label{ex:39}
%Consider the regular Boolean MRSW construction shown in Figure \ref{fig:regular_bool_mrsw}.
%\\
%True of false: if we replace the safe Boolean MRSW register with a safe $M$-valued MRSW register, then the construction yields a regular $M$-valued MRSW register.
%Justify your answer by sketching a proof or counterexample.
%
%\begin{figure}[H]
%	\centering \includegraphics[width=12cm]{res/Regular Boolean MRSW.jpg}
%	\caption{Regular Boolean MRSW construction} \label{fig:regular_bool_mrsw}
%\end{figure}

\subsection{(4 points)} \label{ex:40_new}
Does Peterson's two-thread mutual exclusion algorithm from Figure \ref{fig:peterson_lock} still work if the shared atomic \texttt{flag} registers are replaced by safe registers?
Argue by either providing a proof sketch or counterexample.

\textbf{Solution} Yes, it does.

To prove this, we can consider what really changes.
Only the situations in which a read is being performed concurrently to a write have any changes, so we restrict ourselves to these cases.

Now there is the possibility that for such a read a random value is returned while the write is being performed.
If it is the value that is currently being written, that is completely fine.

If it is the value that is not being written, then it must be the previous value, because the registers are boolean.

We can look at all relevant cases more precisely.
There is only a danger when i accesses the flag register of thread j, because threads generally don't access their own registers.
The only read to $flag[j]$ happens in the while loop, where the lock spins.
The only writes to $flag[i]$ happen either in \texttt{lock()} during the doorway section, or in \texttt{unlock()}.

If the flag is being written in \texttt{lock()} and read at the same time, then it must be the case that flag was previously $false$ and is being written $true$, because the thread is not in the critical section and about to enter it.
This case is fine, since the thread reading the flag (lets call this thread thread A) has already finished its doorway and mutual exclusion is guaranteed because the thread writing the flag (lets call this thread thread B) will spin:
Thread A has set its flag to true and completed its write to victim.
Thread B is in the process of writing the flag, so it will write to to the victim next and become blocked, spinning in the while loop.

If the flag is being written in \texttt{unlock()} (by thread B) and read at the same time (by thread A), then it must be the case that the flag was previously $true$ and being written $false$, because the thread entered the critical section and is exiting it.
If A reads the old value $true$, it continues spinning in the while loop, so in this case only a delay is caused (that is certain to be overcome once the read in the loop returns the new value).

As we can see, there are no problematic cases and thus the Peterson two-thread mutual exclusion algorithm still works, even if the shared atomic \texttt{flag} registers are replaced by safe registers.

%\subsection{(6 points)} \label{ex:41}
%Consider the following implementation of a register in a distributed, message-passing system.
%There are $n$ single-threaded processors $P_0, \ldots, P_{n-1}$ arranged in a ring, where $P_i$ can send messages only to $P_{i+1\ mod\ n}$.
%Messages are delivered in FIFO order along each link.
%Each processor keeps a copy of the shared register.
%\begin{itemize}
%	\item To read a register, the processor reads the copy in its local memory.
%	\item A processor $P_i$ starts a $write()$ call of value $v$ to register $x$, by sending the message `$P_i:\ write\ v\ to\ x$' to $P_{i+1\ mod\ n}$.
%	\item If $P_i$ receives a message `$P_j:\ write\ v\ to\ x$' for $i \ne j$, then it writes $v$ to its local copy of $x$ and forwards the message to $P_{i+1\ mod\ n}$.
%	\item If $P_i$ receives a message `$P_i:\ write\ v\ to\ x$' then it writes $v$ to its local copy of $x$ and discards the message.
%		The $write()$ call is now complete.
%\end{itemize}
%
%For the following questions either provide a short proof sketch or counterexample.
%If write operations do not overlap, ...
%\begin{enumerate}[label=\alph*)]
%	\item is this register implementation regular?
%	\item is it atomic?
%\end{enumerate}
%If multiple processors are allowed to call $write()$ simultaneously, ...
%\begin{enumerate}[label=\alph*)]
%	\setcounter{enumi}{2}
%	\item is this register implementation safe?
%\end{enumerate}
\end{document}
